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Abstract — Distributed storage systems provide reliable access 
to data through redundancy spread over individually unreliable 
nodes. Application scenarios include data centers, peer-to-peer 
storage systems, and storage in wireless networks. Storing data 
using an erasure code, in fragments spread across nodes, requires 
less redundancy than simple replication for the same level 
of reliability. However, since fragments must be periodically 
replaced as nodes fail, a key question is how to generate encoded 
fragments in a distributed way while transferring as little data 
as possible across the network. 

For an erasure coded system, a common practice to repair 
from a node failure is for a new node to download subsets of 
data stored at a number of surviving nodes, reconstruct a lost 
coded block using the downloaded data, and store it at the new 
node. We show that this procedure is sub-optimal. We introduce 
the notion of regenerating codes, which allow a new node to 
download functions of the stored data from the surviving nodes. 
We show that regenerating codes can significantly reduce the 
repair bandwidth. Further, we show that there is a fundamental 
tradeoff between storage and repair bandwidth which we theo- 
retically characterize using flow arguments on an appropriately 
constructed graph. By invoking constructive results in network 
coding, we introduce regenerating codes that can achieve any 
point in this optimal tradeoff. 



I. Introduction 

Q The purpose of distributed storage systems is to store data 
reliably over long periods of time using a distributed collection 
of storage nodes which may be individually unreliable. Appli- 
cations involve storage in large data centers and peer-to-peer 
storage systems such as OceanStore 0, Total Recall J4), and 
DHash+4- 0, that use nodes across the Internet for distributed 
file storage. In wireless sensor networks, obtaining reliable 
storage over unreliable motes might be desirable for robust 
data recovery 0, especially in catastrophic scenarios Q. 

In all these scenarios, ensuring reliability requires the in- 
troduction of redundancy. The simplest form of redundancy 
is replication, which is adopted in many practical storage 
systems. As a generalization of replication, erasure coding 
offers better storage efficiency. For instance, we can divide 
a file of size M. into k pieces, each of size M/k, encode 
them into n coded pieces using an (n, k) maximum distance 
separable (MDS) code, and store them at n nodes. Then, the 
original file can be recovered from any set of k coded pieces. 

Results in this paper have appeared in part in fT) and (2|- 




Fig. 1. The repair problem: Assume that a (4,2) MDS erasure code is used 
to generate 4 fragments (stored in nodes x 1 , . . . x 4 ) with the property that 
any 2 can be used to reconstruct the original data y 1 , y 2 . When node x 4 fails, 
and a newcomer x 5 needs to generate an erasure fragment from x 1 , . . . x 3 , 
what is the minimum amount of information that needs to be communicated? 



This performance is optimal in terms of the redundancy- 
reliability tradeoff because k pieces, each of size M/k, 
provide the minimum data for recovering the file, which is of 
size Ai. Several designs 0,0,0 use erasure codes instead 
of replication. For certain cases, erasure coding can achieve 
orders of magnitude higher reliability for the same redundancy 
factor compared to replication; see, e.g., |9l . 

However, a complication arises: In distributed storage sys- 
tems, redundancy must be continually refreshed as nodes fail 
or leave the system, which involves large data transfers across 
the network. This problem is best illustrated in the simple 
example of Fig. [T] a data object is divided in two fragments 
y x ,y 2 (say, each of size 1Mb) and these encoded into four 
fragments x 1 , . . . x 4 of same size, with the property that any 
two out of the four can be used to recover the original y 1 , y 2 . 
Now assume that storage node x 4 fails and a new node x 5 , 
the newcomer, needs to communicate with existing nodes 
and create a new encoded packet, such that any two out 
of 5 suffice to recover. Clearly, if the newcomer 

can download any two encoded fragments (say from x 1 ^ 2 ), 
reconstruction of the whole data object is possible and then 
a new encoded fragment can be generated (for example by 
making a new linear combination that is independent from the 
existing ones). This, however, requires the communication of 
2Mb in the network to generate an erasure encoded fragment 
of size 1Mb at x 5 . In general, if an object of size Ai is divided 
in k initial fragments, the repair bandwidth with this strategy 
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Fig. 2. Example: A repair for a (4,2)-Minimum-Storage Regenerating Code. All the packets (boxes) in this figure have size 0.5Mb and each node stores two 
packets. Note that any two nodes have four equations that can be used to recover the data, a\,a,2,b\,b2. The parity packets pi , p2 , P3 are used to create the 
two packets of the newcomer, requiring repair bandwidth of 1.5MB. The multiplying coefficients are selected at random and the example is shown over the 
integers for simplicity (although any sufficiently large field would be enough). The key point is that nodes do not send their information but generate smaller 
parity packets of their data and forward them to the newcomer who further mixes them to generate two new packets. Note that the selected coefficients also 
need to be included in the packets, which introduces some overhead. 



is M. bits to generate a fragment of size A4/k. In contrast, if 
replication is used instead, a new replica may simply be copied 
from any other existing node, incurring no bandwidth over- 
head. It was commonly believed that this fc-factor overhead 
in repair bandwidth is an unavoidable overhead that comes 
with the benefits of coding (see, for example, flOl ). Indeed, 
all known coding constructions require access to the original 
data object to generate encoded fragments. 

In this paper we show that, surprisingly, there exist erasure 
codes that can be repaired without communicating the whole 
data object. In particular, for the (4, 2) example, we show that 
the newcomer can download 1.5Mb to repair a failure and 
that this is the information theoretic minimum (see Fig. |2]for 
an example). More generally, we identify a tradeoff between 
storage and repair bandwidth and show that codes exist that 
achieve every point on this optimal tradeoff curve. We call 
codes that lie on this optimal tradeoff curve regenerating 
codes. Note that the tradeoff region computed corrects an error 
in the threshold a c computed in 01 and generalizes the result 
to every feasible (a, 7) pair. 

The two extremal points on the tradeoff curve are of special 
interest and we refer to them as minimum-storage regenerating 
(MSR) codes and minimum-bandwidth regenerating (MBR) 
codes. The former correspond to Maximum Distance Sepa- 
rable (MDS) codes that can also be efficiently repaired. At 
the other end of the tradeoff are the MBR codes, which have 
minimum repair bandwidth. We show that if each storage node 
is allowed to store slightly more than Ai/k bits, the repair 
bandwidth can be significantly reduced. 

The remainder of this paper is organized as follows. In 
Section HI] we discuss relevant background and related work 
from network coding theory and distributed storage systems. 
In Section UTU we introduce the notion of the information flow 
graph, which represents how information is communicated and 



stored in the network as nodes join and leave the system. In 
Section HlI-BI we characterize the minimum storage and repair 
bandwidth and show that there is a tradeoff between these 
two quantities that can be expressed in terms of a maximum 
flow on this graph. We further show that for any finite infor- 
mation flow graph, there exists a regenerating code that can 
achieve any point on the minimum storage/ bandwidth feasible 
region we computed. Finally, in Section [IV] we evaluate the 
performance of the proposed regenerating codes using traces 
of failures in real systems and compare to alternative schemes 
previously proposed in the distributed storage literature. 

II. Background and Related Work 
A. Erasure codes 

Classical coding theory focuses on the tradeoff between 
redundancy and error tolerance. In terms of the redundancy- 
reliability tradeoff, the Maximum Distance Separable (MDS) 
codes are optimal. The most well-known class of MDS erasure 
codes is the Reed-Solomon code. More recent studies on era- 
sure coding focus on other performance metrics. For instance, 
sparse graph codes ifTTIl . ifPHl . Ifl3l| can achieve near-optimal 
performance in terms of the redundancy-reliability tradeoff and 
also require low encoding and decoding complexity. Another 
line of research for erasure coding in storage applications 
is parity array codes; see, e.g., Q4|, OH, OH, 03- The 
array codes are based solely on XOR operations and they 
are generally designed with the objective of low encoding, 
decoding, and update complexities. Plank 11811 gave a tutorial 
on erasure codes for storage applications at USENIX FAST 
2005, which covers Reed-Solomon codes, parity-array codes, 
and LDPC codes. 

Compared to these studies, this paper focuses on differ- 
ent performance metrics. Specifically, motivated by practical 
concerns in large distributed storage systems, we explore 
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erasure codes that offer good tradeoffs in terms of redundancy, 
reliability, and repair bandwidth tradeoff. 

B. Network Coding 

Network coding is a generalization of the conventional rout- 
ing (store-and-forwarding) method. In conventional routing, 
each intermediate node in the network simply stores and for- 
wards information received. In contrast, network coding allows 
the intermediate nodes to generate output data by encoding 
(i.e., computing certain functions of) previously received input 
data. Thus, network coding allows information to be "mixed" 
at intermediate nodes. The potential advantages of network 
coding over routing include resource (e.g., bandwidth and 
power) efficiency, computational efficiency, and robustness 
to network dynamics. As shown by the pioneering work of 
Ahlswede et al. (T9), network coding can increase the possible 
network throughput, and in the multicast case can achieve the 
maximum data rate theoretically possible. 

Subsequent work ll20l . ETl showed that the maximum 
multicast capacity can be achieved by using linear encoding 
functions at each node. The studies by Ho et al. ll22l and 
Sanders et al. f23| further showed that random linear network 
coding over a sufficiently large finite field can (asymptotically) 
achieve the multicast capacity. A polynomial complexity pro- 
cedure to construct deterministic network codes that achieve 
the multicast capacity is given by Jaggi et al. Il24l . 

For distributed storage, the idea of using network coding 
was introduced in J6) for wireless sensor networks. Many 
aspects of coding for storage were further explored (7), l25ll . 
l26l for sensor network applications. Network coding was 
proposed for peer-to-peer content distribution systems l27l 
where random linear operations over packets are performed 
to improve file downloading in large unstructured overlay 
networks. 

The key difference of this paper to this existing literature 
is that we bring the dimension of repair bandwidth into the 
picture, and present fundamental bounds and constructions for 
network codes that need to be maintained over time. Similar to 
this related work, intermediate nodes form linear combinations 
in a finite field and the combination coefficients are also 
stored in each packet, creating some overhead that can be 
made arbitrarily small for larger packet sizes. In regenerating 
codes, repair bandwidth is reduced because many nodes create 
small parity packets of their data that essentially contain 
enough novel information to generate a new encoded fragment, 
without requiring to reconstruct the whole data object. 

C. Distributed storage systems 

A number of recent studies ED, El, ED, ED, EJ, ED 
have designed and evaluated large-scale, peer-to-peer dis- 
tributed storage systems. Redundancy management strategies 
for such systems have been evaluated in J9), l32l . 0, iflOl . 

ED, 03, ED, E3. 

Among these, J9), jU, IflOl compared replication with 
erasure codes in the bandwidth-reliability tradeoff space. The 
analysis of Weatherspoon and Kubiatowicz E] showed that 
erasure codes could reduce bandwidth use by an order of 



magnitude compared with replication. Bhagwan et al. [4] came 
to a similar conclusion in a simulation of the Total Recall 
storage system. 

Rodrigues and Liskov IflOl propose a solution to the repair 
problem that we call the Hybrid strategy: one special storage 
node maintains one full replica in addition to multiple erasure- 
coded fragments. The node storing the replica can produce 
new fragments and send them to newcomers, thus transferring 
just A4/k bytes for a new fragment. However, maintaining an 
extra replica on one node dilutes the bandwidth-efficiency of 
erasure codes and complicates system design. For example, if 
the replica is lost, new fragments cannot be created until it is 
restored. The authors show that in high-churn environments 
(i.e., high rate of node joins/leaves), erasure codes provide 
a large storage benefits but the bandwidth cost is too high 
to be practical for a P2P distributed storage system, using 
the Hybrid strategy. In low-churn environments, the reduction 
in bandwidth is negligible. In moderate-churn environments, 
there is some benefit, but this may be outweighed by the 
added architectural complexity that erasure codes introduce 
as discussed further in Section IIV-EI These conclusions were 
based on an analytical model augmented with parameters 
estimated from traces of real systems. Compared with J5), ifTUl 
used a much smaller value of k (7 instead of 32) and the 
Hybrid strategy to address the code regeneration problem. In 
Section [IV] we follow the evaluation methodology of IflOl to 
measure the performance of the two redundancy maintenance 
schemes that we introduce. 

III. Analysis 

Our analysis is based on a particular graphical represen- 
tation of a distributed storage system, which we refer to as 
an information flow graph Q. This graph describes how the 
information of the data object is communicated through the 
network, stored in nodes with limited memory, and reaches 
reconstruction points at the data collectors. 

A. Information Flow Graph 

The information flow graph is a directed acyclic graph 
consisting of three kinds of nodes: a single data source S, 
storage nodes x* n ,x* ut and data collectors DQ. The single 
node S corresponds to the source of the original data. Storage 
node i in the system is represented by a storage input node 
x|„, and a storage output node x l out ; these two nodes are 
connected by a directed edge x] n — > x' out with capacity equal 
to the amount of data stored at node i. See Figure [3] for an 
illustration. 

Given the dynamic nature of the storage systems that we 
consider, the information flow graph also evolves in time. At 
any given time, each vertex in the graph is either active or 
inactive, depending on whether it is available in the network. 
At the initial time, only the source node S is active; it then 
contacts an initial set of storage nodes, and connects to their 
inputs (Xi„) with directed edges of infinite capacity. From 
this point onwards, the original source node S becomes and 
remains inactive. At the next time step, the initially chosen 
storage nodes become now active; they represent a distributed 
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Fig. 3. Illustration of the information flow graph Q corresponding to the 
(4,2) code of figure 1. A distributed storage scheme uses an (4,2) erasure 
code in which any 2 fragments suffice to recover the original data. If node x 4 
becomes unavailable and a new node joins the system, we need to construct 
new encoded fragment in x 5 . To do so, node a;? is connected to the d = 3 
active storage nodes. Assuming fi bits communicated from each active storage 
node, of interest is the minimum (3 required. The min-cut separating the source 
and the data collector must be larger than Ai = 2Mb for reconstruction to 
be possible. For this graph, the min-cut value is given by I + 2/3, implying 
that fi > 0.5Mb is sufficient and necessary. 



erasure code, corresponding to the desired steady state of the 
system. If a new node j joins the system, it can only be 
connected with active nodes. If the newcomer j chooses to 
connect with active storage node i, then we add a directed edge 
from x* ut to x J in , with capacity equal to the amount of data that 
the newcomer downloads from node i. Note that in general it 
is possible for nodes to download more data than they store, as 
in the example of the (4, 2)-erasure code. If a node leaves the 
system, it becomes inactive. Finally, a data collector DC is a 
node that corresponds to a request to reconstruct the data. Data 
collectors connect to subsets of active nodes through edges 
with infinite capacity. 

An important notion associated with the information flow 
graph is that of minimum cuts: A cut in the graph Q between 
the source S and a fixed data collector node DC is a subset C 
of edges such that, there is no path starting from S to DC that 
does not have one or more edges in C. The minimum cut is 
the cut between S and DC in which the total sum of the edge 
capactities is smallest. 

B. Storage-Bandwidth Tradeoff 

We are now ready for the main result of this paper, 
the characterization of the feasible storage-repair bandwidth 
points. The setup is as follows: The normal redundancy we 
want to maintain requires n active storage nodes, each storing 
a bits. Whenever a node fails, a newcomer downloads (3 bits 
each from any d surviving nodes. Therefore the total repair 
bandwidth is 7 = d(3 (see figure 0. We restrict our attention 
to the symmetric setup where it is required that any k storage 
nodes can recover the original file, and a newcomer downloads 
the same amount of information from each of the existing 
nodes. 

For each set of parameters (n, k, d, a, 7), there is a family 
of information flow graphs, each of which corresponds to 
a particular evolution of node failures/repairs. We denote 
this family of directed acyclic graphs by G(n, k, d, a, 7). An 
(n, k, d, a, 7) tuple will be feasible, if a code with storage a 



and repair bandwidth 7 exists. For the example in figure [3] 
the point (4, 2, 3, 1Mb, 1.5Mb) is feasible (and a code that 
achieves it is shown in figure O and also on the optimal 
tradeoff whereas a standard erasure code which communicates 
the whole data object would correspond to 7 = 2Mb instead. 
Note that n, k, d must be integers while a, f3, 7 are real valued. 

Theorem 1: For any a > a* (d, 7), the points (n, k, d, a, 7) 
are feasible, and linear network codes suffice to achieve them. 
It is information theoretically impossible to achieve points with 
a < 7). The threshold function a*(d, 7) (which also 

depends on n, k) is the following: 

a(d,j) = \ 7e[/W) /(,_!)), « 



k — i 



where 



2Md 



(2k- i - i)i + 2k{d - k + iy 



... a (2d-2k + i + l)i 
9{l) = 2d ' 



The minimum 7 is 

7min = f(k - 1) 



2Md 



2kd-k 2 + k' 



(2) 



(3) 



(4) 



The complete proof of this theorem is given in the Ap- 
pendix. The main idea is that the code repair problem can 
be mapped to a multicasting problem on the information flow 
graph. Known results on network coding for multicasting can 
then be used to establish that code repair can be achieved if 
and only if the underlying information flow graph has enough 
connectivity. The bulk of the technical analysis of the proof 
then involves computing the minimum cuts on arbitrary graphs 
in Q(n,k,d,a,j) and solving an optimization problem for 
minimizing a subject to a sufficient flow constraint. 

The optimal tradeoff curves for fc = 5,n=10,d=9 and 
k = 10, n = 15, d = 14 are shown in Figure |4] (top) and 
(bottom), respectively. 

C. Special Cases: Minimum-Storage Regenerating (MSR) 
Codes and Minimum-Bandwidth Regenerating (MBR) Codes 

We now study two extremal points on the optimal tradeoff 
curve, which correspond to the best storage efficiency and 
the minimum repair bandwidth, respectively. We call codes 
that attain these points minimum-storage regenerating (MSR) 
codes and minimum-bandwidth regenerating (MBR) codes, 
respectively. 

It can be verified that the minimum storage point is achieved 
by the pair 



t 'M Md 

{OtMSRnMSR) — \ — ) 



(5) 



k + 1) 

If we substitute d = k into the above, we note that the total 
network bandwidth for repair is Ai, the size of the original 
file. Therefore, if we only allow a newcomer to contact k 
nodes, it is optimal to download the whole file and then 
compute the new fragment. However, if we allow a newcomer 
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Optimal tradeoff for k=10, n=1 5 
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Fig. 4. Optimal tradeoff curve between storage a and repair bandwidth 7, for k = 5, n = 10 (left) and k = 10, n = 15 (right). For both plots M. = 1 and 
d = n — 1. Note that traditional erasure coding corresponds to the points (7 = 1, a = 0.2) and (7 = 1, a = 0.1) for the top and bottom plots. 
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to contact more than fc nodes, the network bandwidth jmsr 
can be reduced significantly. The minimum network bandwidth 
is clearly achieved by having the newcomer contact all other 
nodes. For instance, for (n, fc) = (14, 7), the newcomer needs 
to download only ^ from each of the d = n — 1 = 13 active 
storage nodes, making the repair bandwidth equal to -^p, 
required to generate a fragment of size ±4-. 

Since the MSR codes store 4^ bits at each node while 
ensuring any k coded blocks can be used to recover the original 
file, the MSR codes have equivalent reliability -redundancy per- 
formance with standard Maximum Distance Separable (MDS) 
codes. However, MSR codes outperform classical MDS codes 
in terms of the network repair bandwidth. 

At the other end of the tradeoff are MBR codes, which 
have minimum repair bandwidth. It can be verified that the 
minimum repair bandwidth point is achieved by 

. / 2Md 2Md \ 

{a M BR,lMB R ) = ^ 2kd _ k2+k , 2M _ k 2 + k )- ( 6 ) 

Note that the minimum bandwidth regenerating codes, the stor- 
age size a is equal to 7, the total number of bits downloaded. 
Therefore MBR codes incur no bandwidth expansion at all, 
just like a replication system does. However, the benefit of 
MBR codes is significantly better storage efficiency. 

IV. Evaluation 

In this section, we compare regenerating codes with other 
redundancy management schemes in the context of dis- 
tributed storage systems. We follow the evaluation method- 
ology of iflOl , which consists of a simple analytical model 
whose parameters are obtained from traces of node availability 
measured in several real distributed systems. 

We begin in Section llV- Al with a discussion of node dynam- 
ics and the objectives relevant to distributed storage systems, 
namely reliability, bandwidth, and disk space. We introduce 
the model in Section IIV-BI and estimate realistic values for 
its parameters in Section IIV-CI Section IIV-DI contains the 
quantitative results of our evaluation. In Section IIV-EI we 
discuss qualitative tradeoffs between regenerating codes and 
other strategies, and how our results change the conclusion 
of HO) that erasure codes provide limited practical benefit. 

A. Node dynamics and objectives 

In this section we introduce some background and termi- 
nology which is common to most of the work discussed in 
Section llTCl 

We draw a distinction between permanent and transient 
node failures. A permanent failure, such as the permanent 
departure of a node from the system or a disk failure, results 
in loss of the data stored on the node. In contrast, data 
is preserved across a transient failure, such as a reboot or 
temporary network disconnection. We say that a node is 
available when its data can be retrieved across the network. 

Distributed storage systems attempt to provide two types 
of reliability: availability and durability. A file is available 
when it can be reconstructed from the data stored on currently 
available nodes. A file's durability is maintained if it has 



not been lost due to permanent node failures: that is, it may 
be available at some point in the future. Both properties are 
desirable, but in this paper we report results for availability 
only. Specifically, we will show^zZe unavailability, the fraction 
of time that the file is not available. 

B. Model 

We use a model which is intended to capture the average- 
case bandwidth used to maintain a file in the system, and 
the resulting average availability of the file. With minor 
exceptionsLjthis model and the subsequent estimation of its 
parameters are equivalent to that of iflOll . Although this evalu- 
ation methodology is a significant simplification of real storage 
systems, it allows us to compare directly with the conclusions 
of IflOl as well as to calculate precise values for rare events. 

The model has two key parameters, / and a. First, we 
assume that in expectation a fraction / of the nodes storing 
file data fail permanently per unit time, causing data transfers 
to repair the lost redundancy. Second, we assume that at any 
given time while a node is storing data, the node is available 
with some probability a (and with probability 1 — a is currently 
experiencing a transient failure). Moreover, the model assumes 
that the event that a node is available is independent of the 
availability of all other nodes. 

Under these assumptions, we can compute the expected 
availability and maintenance bandwidth of various redundancy 
schemes to maintain a file of A4 bytes. We make use of the 
fact that for all schemes except MSR codes, the amount of 
bandwidth used is equal to the amount of redundancy that had 
to be replaced, which is in expectation / times the amount of 
storage used. 

Replication: If we store TZ replicas of the file, then we store 
a total of TZ ■ A4 bytes, and in expectation we must replace 
/ • TZ ■ A4 bytes per unit time. The file is unavailable if no 
replica is available, which happens with probability (1 — a) n . 

Ideal Erasure Codes: For comparison, we show the band- 
width and availability of a hypothetical (n, fc) erasure code 
strategy which can "magically" create a new packet while 
transferring just M.jk bytes (i.e., the size of the packet). 
Setting n = k ■ TZ, this strategy sends / • TZ ■ M bytes per 
unit time and has unavailability probability [/ideal (^j fc) 
fc-i / n 



a l (l-a) r 



Hybrid: If we store one full replica plus an (n, k) erasure 
code where n = k ■ (TZ — 1), then we again store TZ ■ M. 
bytes in total, so we transfer / ■ TZ ■ A4 bytes per unit time in 
expectation. The file is unavailable if the replica is unavailable 
and fewer than k erasure-coded packets are available, which 
happens with probability (1 — a) ■ Uid ei \(n, fc). 

Minimum-Storage Regenerating Codes: An (n, fc) MSR 
Code with redundancy 7Z = n/k stores TZM. bytes in total, so 
/ ■ TZ ■ A4 bytes must be replaced per unit time. We will refer 
to the overhead of an MSR code Smsr as the extra amount 

'in addition to evaluating a larger set of strategies and using a somewhat 
different set of traces, we count bandwidth cost due to permanent node failure 
only, rather than both failures and joins. Most designs (4), 1311 . 1331 can avoid 
reacting to node joins. Additionally, we compute probabilities directly rather 
than using approximations to the binomial. 
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of information that needs to be transferee! compared to the 
fragment size Ai /k: 

A (n - 1)/3msr n-l 



>MSR 



M/k 



(7) 



Therefore, replacing a fragment requires transferring over the 
network <5msr times the size of the fragment in the most 
favorable case when newcomers connect to d = n — 1 
nodes to construct a new fragment. Therefore, this results in 
/ • 7Z ■ M ■ <5msr bytes sent per unit time, and unavailability 

?7ideal(n, k). 

Minimum-Bandwidth Regenerating Codes: 

It is convenient to define the MBR code overhead as the 
amount of information transfered over the ideal fragment size: 



Smbr 



a (n - 1)Pmbr 2(n - 1) 



M/k 



2n 



(8) 



Therefore, an (n, k) MBR Code stores M. ■ n • <5mbr bytes in 
total. So in expectation / • M. ■ n ■ <5mbr bytes are transfered 
per unit time, and the unavailability is again Une^n, k). 

C. Estimating f and a 

In this section we describe how we estimate /, the fraction 
of nodes that permanently fail per unit time, and a, the mean 
node availability, based on traces of node availability in several 
distributed systems. 

We use four traces of node availability with widely varying 
characteristics, summarized in Table J] The PlanetLab All 
Pairs Ping ll36l trace is based on pings sent every 15 minutes 
between all pairs of 200-400 nodes in PlanetLab, a stable, 
managed network research testbed. We consider a node to 
be up in one 15-minute interval when at least half of the 
pings sent to it in that interval succeeded. In a number 
of periods, all or nearly all PlanetLab nodes were down, 
most likely due to planned system upgrades or measurement 
errors. To exclude these cases, we "cleaned" the trace as 
follows: for each period of downtime at a particular node, 
we remove that period (i.e. we consider the node up during 
that interval) when the average number of nodes up during 
that period is less than half the average number of nodes up 
over all time. The Microsoft PCs |[28l trace is derived from 
hourly pings to desktop PCs within Microsoft Corporation. 
The Skype superpeers |j37j trace is based on application-level 
pings at 30-minute intervals to nodes in the Skype superpeer 
network, which may approximate the behavior of a set of 
well-provisioned endhosts, since superpeers may be selected 
in part based on bandwidth availability D~7l . Finally, the trace 
of Gnutella peers |38| is based on application-level pings to 
ordinary Gnutella peers at 7-minute intervals. 

We next describe how we derive / and a from these 
traces. It is of key importance for the storage system to 
distinguish between permanent and transient failures (defined 
in Section HV-At . since only the former requires bandwidth- 
intensive replacement of lost redundancy. Most systems use a 
timeout heuristic: when a node has not responded to network- 
level probes after some period of time t, it is considered to 
have failed permanently. To approximate a storage system's 
behavior, we use the same heuristic. Node availability a is 



then calculated as the mean (over time) fraction of nodes 
which were available among those which were not considered 
permanently failed at that time. 

The resulting values of / and a appear in Table U where 
we have fixed the timeout t at 1 day. Longer timeouts 
reduce overall bandwidth costs iTTOl . 1331 . but begin to impact 
durability 031 and are more likely to produce artificial effects 
in the short (2.5-day) Gnutella trace. 

We emphasize that the procedure described above only 
provides an estimate of / and a which may be biased in 
several ways. Some designs 1331 reincorporate data on nodes 
which return after transient failures which were longer than the 
timeout t, which would reduce /. Additionally, even placing 
files on uniform-random nodes results in selecting nodes that 
are more available l34l and less prone to failure [351] than 
the average node. Finally, we have not accounted for the 
time needed to transfer data onto a node, during which it 
is effectively unavailable. However, we consider it unlikely 
that these biases would impact our main results since we 
are primarily concerned with the relative performance of the 
strategies we compare. 

D. Quantitative results 

Figure [5] shows the tradeoff between mean unavailability 
and mean maintenance bandwidth in each of the strategies of 
Section HV-BI using the values of / and a from Section HV-CI 
and k = 7. Feasible points in the tradeoff space are produced 
by varying the redundancy factor 1Z. The marked points along 
each curve highlight a subset of the feasible points (i.e., points 
for which n is integral). 

Figure [6] shows that relative performance of the various 
strategies is similar for k = 14. 

For conciseness, we omit plots of storage used by the 
schemes. However, disk usage is proportional to bandwidth 
for all schemes we evaluate in this section, with the exception 
of minimum storage regenerating codes. This is because MSR 
codes are the only scheme in which the data transferred onto a 
newcomer is not equal to the amount of data that the newcomer 
finally stores. Instead, the storage used by MSR codes is equal 
to that of the storage used by hypothetical ideal erasure codes, 
and hence MSR codes' space usage is proportional to the 
bandwidth used by ideal codes. 

For example, from Figure [2b) we can compare the strate- 
gies at their feasible points closest to unavailability 0.0001, 
i.e., four nines of availability. At these points, MSR codes use 
about 44% more bandwidth and 28% less storage space than 
Hybrid, while MBR codes use about 3.7% less bandwidth and 
storage space than Hybrid. Additionally, these feasible points 
give MSR and MBR codes somewhat better unavailability than 
Hybrid (.000059 vs. 0.00018). 

One interesting effect apparent in the plots is that MSR 
codes' maintenance bandwidth actually decreases as the re- 
dundancy factor 1Z increases, before coming to a minimum and 
then increasing again. Intuitively, while increasing 7Z increases 
the total amount of data that needs to be maintained, for small 
1Z this is more than compensated for by the reduction in 
overhead. The expected maintenance bandwidth per unit time 
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Trace 


Length 

(days) 


Start 
date 


Mean # 
nodes up 


f 

(fraction failed per day) 


a 


PlanetLab 


527 


Jan. 2004 


303 


0.017 


0.97 


Microsoft PCs 


35 


Jul. 6, 1999 


41970 


0.038 


0.91 


Skype 


25 


Sept. 12, 2005 


710 


0.12 


0.65 


Gnutella 


2.5 


May, 2001 


1846 


0.30 


0.38 



TABLE I 

The availability traces used in this paper. 
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Fig. 5. Availability-bandwidth tradeoff for k = 7 with parameters derived from each of the traces. The key in (d) applies to all four plots. 



is 

77 77 — 1 

fMn5 MSR = fM- -. (9) 

k n — k 

It is easy to see that this function is minimized by selecting n 
one of the two integers closest to 

n opt = k + \Jk 2 - k. (10) 

which approaches a redundancy factor of 2 as k — > oo. 

E. Qualitative comparison 

In this section we discuss two questions: First, based on 
the results of the previous section, what are the qualitative 
advantages and disadvantages of the two extremal regenerating 
codes compared with the Hybrid coding scheme? Second, do 
our results affect the conclusion of Rodrigues and Liskov ifTUl 
that erasure codes offer too little improvement in bandwidth 



use to clearly offset the added complexity that they add to the 
system? 

1 ) Comparison with Hybrid: Compared with Hybrid, for a 
given target availability, minimum storage regenerating codes 
offer slightly lower maintenance bandwidth and storage, and a 
simpler system architecture since only one type of redundancy 
needs to be maintained. An important practical disadvantage 
of using the Hybrid scheme is asymmetric design which can 
cause the disk I/O to become the bottleneck of the system 
during repairs. This is because the disc storing the full replica 
and generates the encoded fragments need to read the whole 
data object and compute the encoded fragment. 

However, MBR codes have at least two disadvantages. First, 
constructing a new packet, or reconstructing the entire file, 
requires communcation with n — 1 nodes! rather than one 

2 The scheme could be adapted to connect to fewer than n — 1 nodes, but 
this would increase maintenance bandwidth. 
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Fig. 6. Availability-bandwidth tradeoff for k = 14 with parameters derived from each of the traces. 



(in Hybrid, the node holding the single replica). This adds 
overhead that could be significant for sufficiently small files or 
sufficiently large n. Perhaps more importantly, there is a factor 
<5mbr increase in total data transferred to read the file, roughly 
30% for a redundancy factor 1Z — 2 and k = 7 or 13% for 
1Z = 4, Thus, if the frequency that a file is read is sufficiently 
high and k is sufficiently small, this inefficiency could become 
unacceptable. Again compared with Hybrid, MSR codes offer 
a simpler, symmetric system design and somewhat lower 
storage space for the same reliability. However, MSR codes 
have somewhat higher maintenance bandwidth and like MSB 
codes require that newcomers and data collectors connect to 
multiple nodes. 

Rodrigues et al. iflOl discussed two principal disadvan- 
tages of using erasure codes in a widely distributed system: 
coding — in particular, the Hybrid strategy — complicates the 
system architecture; and the improvement in maintenance 
bandwidth was minimal in more stable environments, which 
are the more likely deployment scenario. Regenerating codes 
address the first of these issues, which may make coding more 
broadly applicable. 

V. Conclusions 

We presented a general theoretic framework that can de- 
termine the information that must be communicated to repair 



failures in encoded systems and identified a tradeoff between 
storage and repair bandwidth. 

Certainly there are many issues that remain to be addressed 
before these ideas can be implemented in practical systems. 
In future work we plan to investigate deterministic designs 
of regenerating codes over small finite fields, the existence 
of systematic regenerating codes, designs that minimize the 
overhead storage of the coefficients, as well as the impact of 
node dynamics in reliability. Other issues of interest involve 
how CPU processing and disk I/O will influence the system 
performance, as well as integrity and security for the linear 
combination packets (see ||39l for a related analysis for content 
distribution). 

One potential application for the proposed regenerating 
codes is distributed archival storage or backup, which might 
be useful for data center applications. In this case, files are 
likely to be large and infrequently read, making the draw- 
backs mentioned above less significant, so that MBR codes' 
symmetric design may make them a win over Hybrid; and 
the required reliability may also be high, making them a win 
over simple replication. In other applications (such as storage 
system within fast local networks) the required storage may 
become important, and the results of the previous section show 
that minimum storage regenerating codes can be useful. 
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VI. Appendix 

Here we prove Theorem 1. We first start with the following 
simple lemma. 

Lemma 1: No data collector DC can reconstruct the initial 
data object if the minimum cut in Q between S and DC is 
smaller than the initial object size 

Proof: The information of the initial data object must be 
communicated from the source to the particular data collector. 
Since every link in the information flow graph can only be 
used at most once, and since the point-to-point capacity is 
less than the data object size, a standard cut-set bound shows 
that the entropy of the data object conditioned on everything 
observable to the data collector is non-zero and therefore 
reconstruction is impossible. ■ 

The information flow graph casts the original storage prob- 
lem as a network communication problem where the source 
s multicasts the file to the set of all possible data collectors. 
By analyzing the connectivity in the information flow graph, 
we obtain necessary conditions for all possible storage codes, 
as shown in Lemma Q] In addition to providing necessary 
conditions for all codes, the information flow graph can also 
imply the existence of codes under proper assumptions. 

Proposition 1: Consider any given finite information flow 
graph Q, with a finite set of data collectors. If the minimum 
of the min-cuts separating the source with each data collector 
is larger or equal to the data object size Ad, then there exists a 
linear network code defined over a sufficiently large finite field 
F (whose size depends on the graph size) such that all data 
collectors can recover the data object. Further, randomized 
network coding guarantees that all collectors can recover the 
data object with probability that can be driven arbitrarily high 
by increasing the field size. 
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Proof: The key point is observing that the reconstruction 
problem reduces exactly to multicasting on all the possible 
data collectors on the information flow graph Q. Therefore, the 
result follows directly from the constructive results in network 
coding theory for single source multicasting; see the discussion 
of related works on network coding in Section III-BI ■ 

To apply Proposition [1] consider an information flow graph 
Q that enumerates all possible failure/repair patterns and all 
possible data collectors when the number of failures/repairs 
is bounded. This implies that there exists a valid regenerating 
code achieving the necessary cut bound (cf. Lemma Q]), which 
can tolerate a bounded number of failures/repairs. In another 
paper 13, we present coding methods that construct determin- 
istic regenerating codes that can tolerate infinite number of 
failures/repairs, with a bounded field size, assuming only the 
population of active nodes at any time is bounded. For the 
detailed coding theoretic construction, please refer to (2). 

We analyze the connectivity in the information flow graph 
to find the minimum repair bandwidth. The next key lemma 
characterizes the flow in any information flow graph, under 
arbitrary failure pattern and connectivity. 

Lemma 2: Consider any (potentially infinite) information 
flow graph G, formed by having n initial nodes that connect 
directly to the source and obtain a bits, while additional nodes 
join the graph by connecting to d existing nodes and obtaining 
/3 bits from eachU Any data collector t that connects to a k- 
subset of "out-nodes" (cf. Figure [3j> of G must satisfy: 



min{d,/c} — 1 




Fig. 7. G* used in the proof of lemma [5] 

Proof: First, we show that there exists an information flow 
graph G* where the bound (TTTb is matched with equality. This 
graph is illustrated by Figure [7] In this graph, there are initially 
n nodes labeled from 1 to n. Consider k newcomers labeled as 
n+ 1, . . . , n+ k. The newcomer node n + i connects to nodes 
n+i— d, . . . , ri+i— 1. Consider a data collector t that connects 
to the last k nodes, i.e., nodes n + 1, . . . , n + k. Consider a 
cut (U,U) defined as follows. For each i G {1, ...,k}, if 
a < (d — i)(3, then we include x",^* in U; otherwise, we 

3 Note that this setup allows more graphs than those in Q(n, k, d, a, 0). In 
a graph in Q(n, k, d, a, /3), at any time there are n active storage nodes and 
a newcomer can only connect to the active nodes. In contrast, in a graph G 
described in this lemma, there is no notion of "active nodes" and a newcomer 
can connect to any d existing nodes. 



include x™+/ and x™^ 1 in U. Then this cut (U,U) achieves 
(fTTT i with equality. 

We now show that (TTTb must be satisfied for any G formed 
by adding d in-degree nodes as described above. Consider a 
data collector t that connects to a fc-subset of "out-nodes", say 
{ x out '■ We want to show that any s-t cut in G has 

capacity at least 

min{d,/c} — 1 

]T mia{(d-i)/3,a}. (12) 

i=0 

Since the incoming edges of t all have infinite capacity, we 
only need to examine the cuts (11,11) with s G U, 

4 ut eU,VieI. (13) 

Let C denote the edges in the cut, i.e., the set of edges going 
from U to U, 

Every directed acyclic graph has a topological sorting (see, 
e.g., [40]), where a topological sorting (or acyclic ordering) is 
an ordering of its vertices such that the existence of an edge 
from Vj to Vj implies i < j. Let y} out be the topologically first 
output node in U. Consider two cases: 

• If x\ n G U, then the edge x^ n xj ut must be in C. 

• If x\ n £ U, since x\ n has an in-degree of d and it is the 
topologically first node in U, all the incoming edges of 
xj n must be in C. 

Therefore, these edges related to x\ ut will contribute a value 
of min{d/3, a} to the cut capacity. 

Now consider x^ ut , the topologically second output node in 
U. Similar to the above, we have two cases: 

• If x^ n £ U, then the edge x^ n x„ ut must be in C. 

• If x? n G U, since at most one of the incoming edges of 
x?„ can be from y} out , d — 1 incoming edges of xj n must 
be in C. 

Following the same reasoning we find that for the z-th node 
(i = 0, . . . , min{<i, k} — 1) in the sorted set U, either one edge 
of capacity a or (d — i) edges of capacity (3 must be in C. 
Equation ( fTTT i is exactly summing these contributions. ■ 
From Lemma [2] we know that there exists a 
graph G* G Q(n,k,d,a, 0) whose mincut is exactly 
Y^i=o^ d ' k ^ 1 niin{(d — i)[3,a}. This implies that if we 
want to ensure recoverability while allowing a newcomer to 
connect to any set of d existing nodes, then the following is 
a necessary condition^ 

min{d,k} — 1 

min{(d — i)[3, a} > M.. (14) 

i=0 

Furthermore, when this condition is satisfied, we know any 
graph in Q(n, k, d, a, (3) will have enough flow from the source 
to each data collector. For this reason, we say 

min{d,fc} — 1 

C= min{(d- i)P,a} (15) 

i=0 

4 This, however, does not rule out the possibility that the mincut is larger 
if a newcomer can choose the d existing nodes to connect to. We leave this 
as a future work. 
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is the capacity for (n,k,d,a, (3) regenerating codes (where 
each newcomer can access any arbitrary set of k nodes). 

Note that if d < k, requiring any d storage nodes to have 
a flow of j\4 will lead to the same condition (c.f. ( TT4b ) as 
requiring any k storage nodes to have a flow of A4. Hence in 
such a case, we might as well set k as d. For this reason, in 
the following we assume d > k without loss of generality. 

We are interested in characterizing the achievable tradeoffs 
between the storage a and the repair bandwidth df3. To derive 
the optimal tradeoffs, we can fix the repair bandwidth and 
solve for the minimum a such that ( TT4b is satisfied. Recall 
that 7 = d/3 the total repair bandwidth, and the parameters 
(n, k, d, a, 7) can be used to characterize the system. We are 
interested in finding the whole region of feasible points (a, 7) 
and then select the one that minimizes storage a or repair 
bandwidth 7. Consider fixing both 7 and d (to some integer 
value) and minimize a; 



a*(d,j) 



A 



mm a 

k-l 

subject to: 



(16) 



]Tmh 



=0 



i-j )7,<n >m. 



Now observe that the dependence on d must be monotone: 

a*(d+l,7)<a*(d,7). (17) 

This is because a* (d, 7) is always a feasible solution for the 
optimization for a*(d+ 1, 7). Hence a larger d always implies 
a better storage-repair bandwidth tradeoff. 

The optimization ( fToT ) can be explicitly solved: We call the 
solution, the threshold function a* (d, 7), which for a fixed d, 
is piecewise linear: 



a*(d,7) 



M 

hi 



k—i 



7e[/(0),+oo) 
7 6[/(i),/(i-l)), 



where 



9W 



2Md 



{2k-i-l)i + 2k(d-k + l)' 1 
A (2d - 2k + i + l)i 



(18) 



(19) 
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The last part of the proof involves showing that the threshold 
function is the solution of this optimization. To simplify 
notation, introduce 

k l d )l^ fori = 0,...,fc-l. (21) 



h, t I 1 



Then the problem is to minimize a subject to the constraint: 

fe-i 

^min{6i,a} > B. (22) 



i=0 



The left hand side of 
linear function of a: 

ka, 



as a function of a, is a piecewise- 
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60 + (k — l)a, 
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bk-i, 



a e [0, bo] 
a e (60,61] 

a e (6fe_2,6 fe _; 
a e (6 fe _i,oo) 



(23) 



Note from this expression that C(a) is strictly increasing from 
to its maximum value bo + . . . + bk-i as a increases from 
to 6fc_i. To find the minimum a such that C(q) > B, we 
simply let a* = C _1 (S) if B < b + . . . + b k -i: 
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B-bn 
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Be[0, kb ] 

Be (Jb6o, 60 + (* — l)bi] 



b E*=d b i> b s [E-Zo b 3 + bk-2, E-Zo h 

(24) 

For i = 1, . . . , k— 1, the i-th condition in the above expression 
is: 
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for B 6 bj + (k- i)bi- X ,^2 b i + (k-i- l)b t 
Note from the definition of {6^} d2Tb that 
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and 



^6 J + (fc-i-l)6i 
3=0 

,2d-2k + i + 2 , 
=7(* + 1) gd + (*-*- 1)7 ( 1 

2z/c - i 2 - i + 2k + 2kd - 2k 2 



k-l-i 



=7" 



2d 



(20) =7 _?_ 



where /(i) and y(z) are defined in (f2|([3]l. Hence we have: 

B 9(i) e „ D ^{ IB 7 B ' 



a* 



for B e 



k-i ' -!)'/(<) 
The expression of a*(d, 7) then follows. 



